This quickstart guide explains how to join two tables A and B using TF-IDF similarity measure. First, you need to import the required packages as follows (if you have installed py_stringsimjoin it will automatically install the dependencies py_stringmatching and pandas):


In [1]:
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os
import sys

In [3]:
print('python version: ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)


python version: 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
py_stringsimjoin version: 0.1.0
py_stringmatching version: 0.2.0
pandas version: 0.16.2

Joining two tables using TD-IDF measure typically consists of six steps:

  1. Loading the input tables
  2. Profiling the tables
  3. Creating a tokenizer
  4. Applying overlap filter
  5. Creating the corpus for TF-IDF matcher
  6. Applying the TF-IDF matcher

1. Loading the input tables

We begin by loading the two tables. For the purpose of this guide, we use the books dataset that comes with the package.


In [4]:
# construct the path of the tables to be loaded. Since we are loading a 
# dataset from the package, we need to access the data from the path 
# where the package is installed. If you need to load your own data, you can directly
# provide your table path to the read_csv command.

table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_table_A.csv.gz'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_table_B.csv.gz'])

In [5]:
# Load csv files as dataframes. Since we are reading a compressed csv file, 
# we provide the compression argument. If you are reading an uncompressed 
# csv file, you should not specify the compression argument.

A = pd.read_csv(table_A_path, compression='gzip')
B = pd.read_csv(table_B_path, compression='gzip')
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))


Number of records in A: 3022
Number of records in B: 3099

In [6]:
A.head(1)


Out[6]:
ID Title Price Author ISBN13 Publisher Publication_Date Pages Dimensions
0 BN0001 The Maze Runner Series Complete Collection $24.21 James Dashner 9780385388894 Random House Children's Books 7/8/14 NaN 5.60(w) x 8.40(h) x 3.30(d)

In [7]:
B.head(1)


Out[7]:
ID Title UsedPrice NewPrice Author ISBN10 ISBN13 Publisher Publication_Date Pages Dimensions
0 HC0001 The Hunger Games(Hunger Games (Quality)) $4.98 $11.02 Suzanne Collins 0439023521 9780439023528 Scholastic Press Jul. 3rd, 2010 374 5.30 x 8.00 x 0.90 inches

2. Profiling the tables

Before performing the join, we may want to profile the tables to know about the characteristics of the attributes. This can help identify:

a) unique attributes in the table which can be used as key attribute when performing the join. A key attribute is needed to uniquely identify a tuple.

b) the number of missing values present in each attribute. This can help you in deciding the attribute on which to perform the join. For example, an attribute with a lot of missing values may not be a good join attribute. Further, based on the missing value information you need to decide on how to handle missing values when performing the join

You can profile the attributes in a table using the following command:


In [28]:
# profile attributes in table A
ssj.profile_table_for_join(A)


Out[28]:
Unique values Missing values Comments
Attribute
ID 3022 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
Title 2859 (94.61%) 0 (0.0%)
Price 698 (23.1%) 0 (0.0%)
Author 1225 (40.54%) 0 (0.0%)
ISBN13 2869 (94.94%) 0 (0.0%)
Publisher 200 (6.62%) 0 (0.0%)
Publication_Date 1100 (36.4%) 0 (0.0%)
Pages 303 (10.03%) 12 (0.4%) Joining on this attribute will ignore 12 (0.4%) rows.
Dimensions 1764 (58.37%) 20 (0.66%) Joining on this attribute will ignore 20 (0.66%) rows.

In [26]:
# profile attributes in table B
ssj.profile_table_for_join(B)


Out[26]:
Unique values Missing values Comments
Attribute
ID 3099 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
Title 2913 (94.0%) 0 (0.0%)
UsedPrice 320 (10.33%) 3 (0.1%) Joining on this attribute will ignore 3 (0.1%) rows.
NewPrice 883 (28.49%) 166 (5.36%) Joining on this attribute will ignore 166 (5.36%) rows.
Author 1371 (44.24%) 5 (0.16%) Joining on this attribute will ignore 5 (0.16%) rows.
ISBN10 3099 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
ISBN13 3099 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
Publisher 367 (11.84%) 0 (0.0%)
Publication_Date 1165 (37.59%) 0 (0.0%)
Pages 510 (16.46%) 0 (0.0%)
Dimensions 2572 (82.99%) 248 (8.0%) Joining on this attribute will ignore 248 (8.0%) rows.

Based on the profile output, we find that the 'Title' attribute in both tables does not contain any missing values. Hence, for the purpose of this guide, we will now join tables A and B on 'Title' attribute using TF-IDF measure. Next, we need to decide on what threshold to use for the join. For this guide, we will use a threshold of 0.5. Specifically, the join will now find tuple pairs from A and B such that the TF-IDF score over the 'Title' attributes is at least 0.5.

Naively, performing the join will involve enumerating the cartesian product AxB (3022 x 3099 = 9365178) and computing TF-IDF score for every pair. But, this can be very time consuming. Hence, we can optimize by first appplying an overlap filter over tables A and B to find pairs sharing at least one token in the 'Title' attribute. The intuition here is that in order for TF-IDF score to be above zero, there must be at least one common token between the attributes. Finally, we apply the TF-IDF measure over the candidate pairs to obtain the join output.

3. Creating a tokenizer

Since TF-IDF measure treats input strings as bags of tokens, we need to select a tokenizer which can be used to tokenize each string into a bag of tokens. Currently, we support tokenizers from py_stringmatching package which provides five different tokenizer types: alphabetical tokenizer, alphanumeric tokenizer, delimiter-based tokenizer, qgram tokenizer, and whitespace tokenizer.

For the purpose of this guide, we will use a whitespace tokenizer. Once we have selected a tokenizer type, we need to create a tokenizer object as shown below:


In [8]:
# create whitespace tokenizer for tokenizing 'Title' attribute
ws = sm.WhitespaceTokenizer()

ws.tokenize('The Maze Runner Series Complete Collection')


Out[8]:
['The', 'Maze', 'Runner', 'Series', 'Complete', 'Collection']

4. Applying overlap filter

Next, we need to create and apply overlap filter over tables A and B to obtain a candidate set of pairs C.

In [9]:
# create overlap filter with whitespace tokenizer and threshold of 1. 
of = ssj.OverlapFilter(ws, 1) 

# apply overlap filter to tables A and B to find tuple pairs 
# sharing at least 1 token in Title attribute
C = of.filter_tables(A, B, 'ID', 'ID', 'Title', 'Title', n_jobs=-1)


0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 0.432 sec

In [10]:
len(C)


Out[10]:
1103242

In [11]:
C.head(5)


Out[11]:
_id l_ID r_ID
0 0 BN0001 HC0001
1 1 BN0002 HC0001
2 2 BN0003 HC0001
3 3 BN0007 HC0001
4 4 BN0008 HC0001

If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True when creating the overlap filter as shown below:


In [ ]:
of = ssj.OverlapFilter(ws, 1, allow_missing=True)

Now, when you apply the filter, pairs with missing values will also be included in the output.

5. Creating the corpus for TF-IDF matcher

The next step is to create the corpus required for TF-IDF measure. Specifically, the corpus consists of the list of tokens in the 'Title' attribute. The corpus can be created as follows:


In [12]:
# create a list of tokens
A_tokens = A['Title'].apply(ws.tokenize).tolist()
B_tokens = B['Title'].apply(ws.tokenize).tolist()

# merge both the lists of tokens to create the corpus
corpus = A_tokens + B_tokens

6. Applying the TF-IDF matcher

Finally, you need to create and apply the TF-IDF matcher as shown below:


In [13]:
# create tf-idf object with the generated corpus
tfidf = sm.TfIdf(corpus, dampen=True)

In [14]:
# apply the matcher with a threshold of 0.5. This will find pairs from C 
# with TF-IDF score >= 0.5. Setting n_jobs=-1 exploits all CPU cores available.

output_pairs = ssj.apply_matcher(C, 'l_ID', 'r_ID', A, B, 'ID', 'ID', 'Title', 'Title',
                                 ws, tfidf.get_sim_score, 0.5, 
                                 l_out_attrs=['Title'], r_out_attrs=['Title'], n_jobs=-1)


0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 16.146 sec

In [15]:
len(output_pairs)


Out[15]:
1222

In [16]:
output_pairs.head()


Out[16]:
_id l_ID r_ID l_Title r_Title _sim_score
0 33 BN0070 HC0001 The Hunger Games (Hunger Games Series #1) The Hunger Games(Hunger Games (Quality)) 0.565250
1 1665 BN0019 HC0003 Looking for Alaska Looking for Alaska 1.000000
2 1885 BN0043 HC0006 The Book Thief The Book Thief 1.000000
3 2020 BN0319 HC0006 Thief of Always The Book Thief 0.510586
4 2449 BN1295 HC0006 Midnight Thief (Midnight Thief Series #1) The Book Thief 0.512195

If you want to include pairs with missing value in the output, you need to set the allow_missing flag to True in the apply_matcher method.